Retry on leader lease renewal failure #9563

davidjumani · 2024-05-31T16:18:58Z

Description

After a container has become a leader, any Kube API server unavailability results in the container crashing. This happens as the leader is unable to renew the lease. This is by design as outlined here

if we allowed gloo to continue to function as a leader during kube apiserver outage, we risk having two leaders in other failure modes. we should remove the panic and allow gloo to continue to serve last-known xds as a follower (effectively having two followers until kube apiserver recovers). this idea is similar to the role xds relay could play for gloo edge

However, crashing the gloo pods can also lead to an outage during scaling as the gateway-proxy pods that come up cannot fetch any configs and all routes result in 404s.

Now instead of crashing, the gloo pod falls back to a follower. This prevents an outage and any other pod can take over as leader if possible
This is fine as a leader only writes reports / statuses over here and here. On any failure, the pod becomes a follower and if elected back as a leader will continue to write reports.

Code changes

Introduce a new Reset method on the identity implementation that allows an identity to fall back to a follower

CI changes

Kind clusters that run in CI now do not come up with their own CNI as kindnet does not support custom network policies.
Instead, cilium is installed as CNI as we need to test Kube API server unavailability

Context

Kube API unavailability results in a gloo container crash
When leader election fails, gloo crashes
Design Doc

Interesting decisions

Testing steps

Kube2e tests to verify the following :

It can recover from a Kube APIServer unavailability
It falls back to a follower
Another leader can get elected

Notes for reviewers

Be sure to verify intended behavior by ...

Please proofread comments on ...

This is a complex PR and may require a huddle to discuss ...

Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works

BOT NOTES:
resolves #8107

github-actions · 2024-05-31T16:40:02Z

Visit the preview URL for this PR (updated for commit 7cf3c2b):

https://gloo-edge--pr9563-dont-crash-on-failed-6s4x9piv.web.app

_{(expires Tue, 25 Jun 2024 00:48:42 GMT)}

_{🔥 via Firebase Hosting GitHub Action 🌎}

_{Sign: 77c2b86e287749579b7ff9cadb81e099042ef677}

pkg/bootstrap/leaderelector/kube/factory.go

ashishb-solo

thanks @davidjumani - i've read it over a few times and have a few questions. i think a little more documentation (especially around the go funcs) would be helpful. also, a major question regarding whether we still have the regular behavior if we're not configured to recover from failure.

happy to talk in person if that would be easier

ps: i haven't had a chance to look at the test yet

pkg/bootstrap/leaderelector/kube/factory.go

projects/gloo/pkg/setup/setup.go

test/kube2e/gloo/bootstrap_clients_test.go

sam-heilbron

looking good! I mainly focused on how we're testing for now, and once that is updated, I'm happy to take a closer look at the implementation itself

ci/kind/cluster.yaml

ci/kind/setup-kind.sh

pkg/bootstrap/leaderelector/identity.go

pkg/bootstrap/leaderelector/kube/factory.go

test/kube2e/gloo/bootstrap_clients_test.go

pkg/bootstrap/leaderelector/kube/factory.go

test/kube2e/gloo/artifacts/block.yaml

test/kube2e/gloo/bootstrap_clients_test.go

ashishb-solo

looks good to me. one question on the comment from Sam regarding whether the goroutine is safe. if so, happy to ignore the nits to approve

pkg/bootstrap/leaderelector/kube/factory.go

davidjumani · 2024-06-18T17:58:20Z

kick bulldozer

* retry on leader lease renewal failure * Die if unable to recover * use env var * add basic tests * udpate tests * add comments * Add commnets around ci changes * update tests * refactor * cleanup * udpate test * rename env var * add changelog * address comments v1 * address comments v2 * fix test * sue GinkgoHelper * Adding changelog file to new location * Deleting changelog file from old location * specify a duration * Adding changelog file to new location * Deleting changelog file from old location * remove default * Adding changelog file to new location * Deleting changelog file from old location * fix tests * move changelog * move conter --------- Co-authored-by: soloio-bulldozer[bot] <48420018+soloio-bulldozer[bot]@users.noreply.github.com> Co-authored-by: changelog-bot <changelog-bot> Co-authored-by: Nathan Fudenberg <nathan.fudenberg@solo.io>

retry on leader lease renewal failure

bd2eea7

davidjumani added the work in progress signals bulldozer to keep pr open (don't auto-merge) label May 31, 2024

Merge branch 'main' into dont-crash-on-failed-lease-renewal

2fb77bd

github-actions bot added the keep pr updated signals bulldozer to keep pr up to date with base branch label May 31, 2024

Merge branch 'main' into dont-crash-on-failed-lease-renewal

1111d42

davidjumani removed the work in progress signals bulldozer to keep pr open (don't auto-merge) label May 31, 2024

soloio-bulldozer bot and others added 9 commits June 3, 2024 20:54

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

ada4f23

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

088f7ed

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

13bc6fc

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

def7557

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

6c971b1

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

17c57d4

Die if unable to recover

154fb8e

use env var

7553a15

add basic tests

a9c06fd

davidjumani commented Jun 6, 2024

View reviewed changes

pkg/bootstrap/leaderelector/kube/factory.go Outdated Show resolved Hide resolved

davidjumani and others added 3 commits June 6, 2024 11:39

Merge branch 'main' into dont-crash-on-failed-lease-renewal

c156004

udpate tests

676ea72

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

108c4cc

ashishb-solo reviewed Jun 6, 2024

View reviewed changes

pkg/bootstrap/leaderelector/kube/factory.go Show resolved Hide resolved

pkg/bootstrap/leaderelector/kube/factory.go Outdated Show resolved Hide resolved

pkg/bootstrap/leaderelector/kube/factory.go Outdated Show resolved Hide resolved

projects/gloo/pkg/setup/setup.go Show resolved Hide resolved

ashishb-solo reviewed Jun 6, 2024

View reviewed changes

test/kube2e/gloo/bootstrap_clients_test.go Outdated Show resolved Hide resolved

sam-heilbron reviewed Jun 6, 2024

View reviewed changes

davidjumani added 7 commits June 6, 2024 18:49

add comments

0652d9e

Add commnets around ci changes

e7d8ce3

update tests

1edf8e6

refactor

d4f48bc

cleanup

a728b0f

udpate test

d84ca48

rename env var

6df0df3

soloio-bulldozer bot and others added 12 commits June 13, 2024 16:14

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

ebe027b

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

5603a70

Merge branch 'main' into dont-crash-on-failed-lease-renewal

d48db98

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

3eabbcb

Adding changelog file to new location

3c2792a

Deleting changelog file from old location

797e041

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

b2f6037

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

e668fb4

fix tests

9690491

move changelog

feebc98

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

943220e

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

910b4f0

nfuden previously approved these changes Jun 14, 2024

View reviewed changes

soloio-bulldozer bot added 4 commits June 17, 2024 16:28

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

0e01bf0

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

3497431

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

074e876

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

6dae462

ashishb-solo reviewed Jun 17, 2024

View reviewed changes

pkg/bootstrap/leaderelector/kube/factory.go Outdated Show resolved Hide resolved

move conter

df0683a

davidjumani dismissed nfuden’s stale review via df0683a June 17, 2024 18:28

ashishb-solo approved these changes Jun 17, 2024

View reviewed changes

Merge refs/heads/main into dont-crash-on-failed-lease-renewal

7cf3c2b

nfuden approved these changes Jun 18, 2024

View reviewed changes

davidjumani mentioned this pull request Jun 18, 2024

Chaos Testing #9638

Open

soloio-bulldozer bot merged commit e499b77 into main Jun 18, 2024
20 checks passed

soloio-bulldozer bot deleted the dont-crash-on-failed-lease-renewal branch June 18, 2024 17:58

davidjumani mentioned this pull request Jun 18, 2024

[1.17] Retry on leader lease renewal failure #9639

Merged

4 tasks

bewebi mentioned this pull request Sep 25, 2024

[FAIL] Bootstrap Clients Retry leader election failure [It] does not recover by default #10113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on leader lease renewal failure #9563

Retry on leader lease renewal failure #9563

davidjumani commented May 31, 2024 •

edited by solo-changelog-bot bot

Loading

github-actions bot commented May 31, 2024 •

edited

Loading

ashishb-solo left a comment •

edited

Loading

sam-heilbron left a comment

ashishb-solo left a comment

davidjumani commented Jun 18, 2024

Retry on leader lease renewal failure #9563

Retry on leader lease renewal failure #9563

Conversation

davidjumani commented May 31, 2024 • edited by solo-changelog-bot bot Loading

Description

Code changes

CI changes

Context

Interesting decisions

Testing steps

Notes for reviewers

Checklist:

github-actions bot commented May 31, 2024 • edited Loading

ashishb-solo left a comment • edited Loading

Choose a reason for hiding this comment

sam-heilbron left a comment

Choose a reason for hiding this comment

ashishb-solo left a comment

Choose a reason for hiding this comment

davidjumani commented Jun 18, 2024

davidjumani commented May 31, 2024 •

edited by solo-changelog-bot bot

Loading

github-actions bot commented May 31, 2024 •

edited

Loading

ashishb-solo left a comment •

edited

Loading